In the last post, I have acquired the data for one product in Amazon and did the analysis and converted into corpus and done a wordcloud. In this blog I plan to scrape more reviews on different products and do preprocessing of data.
Loading the libraries
Code
library(polite)library(rvest)
Warning: package 'rvest' was built under R version 4.2.2
Attaching package: 'quanteda.sentiment'
The following object is masked from 'package:quanteda':
data_dictionary_LSD2015
Code
knitr::opts_chunk$set(echo =TRUE)
Code
scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}
Code
reviews <-read_csv("amazonreview.csv")
New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
reviews
Code
clean_text <-function (text) {str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuation, using a standard character classstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}
Feature co-occurrence matrix of: 3,795 by 3,795 features.
features
features love fantasy ever since kid stories set creative worlds groups
love 13025 2234 3670 2395 337 1701 1155 224 161 72
fantasy 0 1779 884 584 45 554 456 76 99 20
ever 0 0 722 758 90 408 347 64 55 17
since 0 0 0 455 60 271 309 36 28 15
kid 0 0 0 0 48 23 27 6 5 5
stories 0 0 0 0 0 313 190 28 33 33
set 0 0 0 0 0 0 237 26 29 11
creative 0 0 0 0 0 0 0 22 3 2
worlds 0 0 0 0 0 0 0 0 8 4
groups 0 0 0 0 0 0 0 0 0 2
[ reached max_feat ... 3,785 more features, reached max_nfeat ... 3,785 more features ]
Code
# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)
[1] 50 50
Code
# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)
I will be adding more reviews and doing pre-processing and also plot some analysis plots and if possible also do some sentiment analysis.
Source Code
---title: "Blog Post 3"author: "Mani Shanker Kamarapu"desription: "Acquiring Data and Tidying data"date: "11/5/2022"format: html: df-print: paged css: styles.css toc: true code-fold: true code-copy: true code-tools: truecategories: - Post3 - ManiShankerKamarapu - Amazon Review analysis---## IntroductionIn the last post, I have acquired the data for one product in Amazon and did the analysis and converted into corpus and done a wordcloud. In this blog I plan to scrape more reviews on different products and do preprocessing of data.## Loading the libraries```{r}library(polite)library(rvest)library(tidyverse)library(stringr)library(quanteda)library(tidyr)library(RColorBrewer)library(quanteda.textplots)library(wordcloud)library(wordcloud2)library(devtools)library(quanteda.dictionaries)library(quanteda.sentiment)knitr::opts_chunk$set(echo =TRUE)``````{r}scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("https://www.amazon.com/product-reviews/",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}```<!-- ```{r} --><!-- out <- scrape_amazon("B0001DBI1Q", 1) --><!-- for (i in 2:400) { --><!-- out <- bind_rows(out, scrape_amazon("B0001DBI1Q", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B0001MC01Y", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B00026WUZU", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07ZN4WM13", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B005C7QVUE", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000BO2D64", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 336:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000I2JFQU", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 311:400) { --><!-- out <- bind_rows(out, scrape_amazon("B000UW50LW", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B001FD6RLM", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07HHJ7669", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07T6BQV2L", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- for (i in 1:400) { --><!-- out <- bind_rows(out, scrape_amazon("B07T43YYRY", i)) --><!-- if((i %% 3) == 0){ --><!-- Sys.sleep(2) # Take an additional two second break --><!-- } --><!-- } --><!-- ``` --><!-- ```{r} --><!-- write.csv(out, "amazonreview.csv") --><!-- ``` -->```{r}reviews <-read_csv("amazonreview.csv")reviews``````{r}clean_text <-function (text) {str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuation, using a standard character classstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}``````{r}reviews$clean_text <-clean_text(reviews$review_text) reviews <- reviews %>%drop_na(clean_text)reviews``````{r}text <-corpus(c(reviews$clean_text))text <-dfm(tokens(text, remove_punct=TRUE, remove_numbers =TRUE) %>%tokens_select(pattern=stopwords("en"),selection="remove"))text``````{r}sum(ntoken(text))``````{r}summary(corpus(c(reviews$clean_text)))``````{r}word_counts <-as.data.frame(sort(colSums(text),dec=T))colnames(word_counts) <-c("Frequency")word_counts$word <-row.names(word_counts)word_counts$Rank <-c(1:ncol(text))word_counts``````{r}text_dfm <-dfm_trim(text, min_termfreq =50, docfreq_type ="prop")# create fcm from dfmtext_fcm <-fcm(text_dfm)text_fcm``````{r}# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)``````{r}textplot_wordcloud(text, min_count =100, max_words =200, random_order =TRUE)```## Further studyI will be adding more reviews and doing pre-processing and also plot some analysis plots and if possible also do some sentiment analysis.